Anik Chakraborty
Problem Statemenet:
A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia. The data is provided in the CSV file below. The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not. The company wants to know the following things about the prospective properties:
## Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import sweetviz as sv
import warnings
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler, MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
warnings.filterwarnings('ignore')
# Reading data from csv
housing_df= pd.read_csv(r'C:\Users\wayto\Desktop\housing\train.csv')
housing_df.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
# Checking shape
housing_df.shape
(1460, 81)
# Checking dataframe info
housing_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 1452 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB
# Checking descriptive statistics
housing_df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Id | 1460.0 | 730.500000 | 421.610009 | 1.0 | 365.75 | 730.5 | 1095.25 | 1460.0 |
| MSSubClass | 1460.0 | 56.897260 | 42.300571 | 20.0 | 20.00 | 50.0 | 70.00 | 190.0 |
| LotFrontage | 1201.0 | 70.049958 | 24.284752 | 21.0 | 59.00 | 69.0 | 80.00 | 313.0 |
| LotArea | 1460.0 | 10516.828082 | 9981.264932 | 1300.0 | 7553.50 | 9478.5 | 11601.50 | 215245.0 |
| OverallQual | 1460.0 | 6.099315 | 1.382997 | 1.0 | 5.00 | 6.0 | 7.00 | 10.0 |
| OverallCond | 1460.0 | 5.575342 | 1.112799 | 1.0 | 5.00 | 5.0 | 6.00 | 9.0 |
| YearBuilt | 1460.0 | 1971.267808 | 30.202904 | 1872.0 | 1954.00 | 1973.0 | 2000.00 | 2010.0 |
| YearRemodAdd | 1460.0 | 1984.865753 | 20.645407 | 1950.0 | 1967.00 | 1994.0 | 2004.00 | 2010.0 |
| MasVnrArea | 1452.0 | 103.685262 | 181.066207 | 0.0 | 0.00 | 0.0 | 166.00 | 1600.0 |
| BsmtFinSF1 | 1460.0 | 443.639726 | 456.098091 | 0.0 | 0.00 | 383.5 | 712.25 | 5644.0 |
| BsmtFinSF2 | 1460.0 | 46.549315 | 161.319273 | 0.0 | 0.00 | 0.0 | 0.00 | 1474.0 |
| BsmtUnfSF | 1460.0 | 567.240411 | 441.866955 | 0.0 | 223.00 | 477.5 | 808.00 | 2336.0 |
| TotalBsmtSF | 1460.0 | 1057.429452 | 438.705324 | 0.0 | 795.75 | 991.5 | 1298.25 | 6110.0 |
| 1stFlrSF | 1460.0 | 1162.626712 | 386.587738 | 334.0 | 882.00 | 1087.0 | 1391.25 | 4692.0 |
| 2ndFlrSF | 1460.0 | 346.992466 | 436.528436 | 0.0 | 0.00 | 0.0 | 728.00 | 2065.0 |
| LowQualFinSF | 1460.0 | 5.844521 | 48.623081 | 0.0 | 0.00 | 0.0 | 0.00 | 572.0 |
| GrLivArea | 1460.0 | 1515.463699 | 525.480383 | 334.0 | 1129.50 | 1464.0 | 1776.75 | 5642.0 |
| BsmtFullBath | 1460.0 | 0.425342 | 0.518911 | 0.0 | 0.00 | 0.0 | 1.00 | 3.0 |
| BsmtHalfBath | 1460.0 | 0.057534 | 0.238753 | 0.0 | 0.00 | 0.0 | 0.00 | 2.0 |
| FullBath | 1460.0 | 1.565068 | 0.550916 | 0.0 | 1.00 | 2.0 | 2.00 | 3.0 |
| HalfBath | 1460.0 | 0.382877 | 0.502885 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| BedroomAbvGr | 1460.0 | 2.866438 | 0.815778 | 0.0 | 2.00 | 3.0 | 3.00 | 8.0 |
| KitchenAbvGr | 1460.0 | 1.046575 | 0.220338 | 0.0 | 1.00 | 1.0 | 1.00 | 3.0 |
| TotRmsAbvGrd | 1460.0 | 6.517808 | 1.625393 | 2.0 | 5.00 | 6.0 | 7.00 | 14.0 |
| Fireplaces | 1460.0 | 0.613014 | 0.644666 | 0.0 | 0.00 | 1.0 | 1.00 | 3.0 |
| GarageYrBlt | 1379.0 | 1978.506164 | 24.689725 | 1900.0 | 1961.00 | 1980.0 | 2002.00 | 2010.0 |
| GarageCars | 1460.0 | 1.767123 | 0.747315 | 0.0 | 1.00 | 2.0 | 2.00 | 4.0 |
| GarageArea | 1460.0 | 472.980137 | 213.804841 | 0.0 | 334.50 | 480.0 | 576.00 | 1418.0 |
| WoodDeckSF | 1460.0 | 94.244521 | 125.338794 | 0.0 | 0.00 | 0.0 | 168.00 | 857.0 |
| OpenPorchSF | 1460.0 | 46.660274 | 66.256028 | 0.0 | 0.00 | 25.0 | 68.00 | 547.0 |
| EnclosedPorch | 1460.0 | 21.954110 | 61.119149 | 0.0 | 0.00 | 0.0 | 0.00 | 552.0 |
| 3SsnPorch | 1460.0 | 3.409589 | 29.317331 | 0.0 | 0.00 | 0.0 | 0.00 | 508.0 |
| ScreenPorch | 1460.0 | 15.060959 | 55.757415 | 0.0 | 0.00 | 0.0 | 0.00 | 480.0 |
| PoolArea | 1460.0 | 2.758904 | 40.177307 | 0.0 | 0.00 | 0.0 | 0.00 | 738.0 |
| MiscVal | 1460.0 | 43.489041 | 496.123024 | 0.0 | 0.00 | 0.0 | 0.00 | 15500.0 |
| MoSold | 1460.0 | 6.321918 | 2.703626 | 1.0 | 5.00 | 6.0 | 8.00 | 12.0 |
| YrSold | 1460.0 | 2007.815753 | 1.328095 | 2006.0 | 2007.00 | 2008.0 | 2009.00 | 2010.0 |
| SalePrice | 1460.0 | 180921.195890 | 79442.502883 | 34900.0 | 129975.00 | 163000.0 | 214000.00 | 755000.0 |
## Checking percentage of missing values
missing_info= round(housing_df.isna().sum() * 100/housing_df.shape[0], 2)
missing_info[missing_info > 0].sort_values(ascending= False)
PoolQC 99.52 MiscFeature 96.30 Alley 93.77 Fence 80.75 FireplaceQu 47.26 LotFrontage 17.74 GarageYrBlt 5.55 GarageType 5.55 GarageFinish 5.55 GarageQual 5.55 GarageCond 5.55 BsmtFinType2 2.60 BsmtExposure 2.60 BsmtFinType1 2.53 BsmtCond 2.53 BsmtQual 2.53 MasVnrArea 0.55 MasVnrType 0.55 Electrical 0.07 dtype: float64
# Getting column names having missing values
missing_val_cols= missing_info[missing_info > 0].sort_values(ascending= False).index
missing_val_cols
Index(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'LotFrontage',
'GarageYrBlt', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
'BsmtFinType2', 'BsmtExposure', 'BsmtFinType1', 'BsmtCond', 'BsmtQual',
'MasVnrArea', 'MasVnrType', 'Electrical'],
dtype='object')
# Checking unique values in these columns
for col in missing_val_cols:
print('\nColumn Name:',col)
print(housing_df[col].value_counts(dropna= False))
Column Name: PoolQC
NaN 1453
Gd 3
Ex 2
Fa 2
Name: PoolQC, dtype: int64
Column Name: MiscFeature
NaN 1406
Shed 49
Gar2 2
Othr 2
TenC 1
Name: MiscFeature, dtype: int64
Column Name: Alley
NaN 1369
Grvl 50
Pave 41
Name: Alley, dtype: int64
Column Name: Fence
NaN 1179
MnPrv 157
GdPrv 59
GdWo 54
MnWw 11
Name: Fence, dtype: int64
Column Name: FireplaceQu
NaN 690
Gd 380
TA 313
Fa 33
Ex 24
Po 20
Name: FireplaceQu, dtype: int64
Column Name: LotFrontage
NaN 259
60.0 143
70.0 70
80.0 69
50.0 57
...
106.0 1
38.0 1
138.0 1
140.0 1
137.0 1
Name: LotFrontage, Length: 111, dtype: int64
Column Name: GarageYrBlt
NaN 81
2005.0 65
2006.0 59
2004.0 53
2003.0 50
..
1906.0 1
1927.0 1
1900.0 1
1908.0 1
1933.0 1
Name: GarageYrBlt, Length: 98, dtype: int64
Column Name: GarageType
Attchd 870
Detchd 387
BuiltIn 88
NaN 81
Basment 19
CarPort 9
2Types 6
Name: GarageType, dtype: int64
Column Name: GarageFinish
Unf 605
RFn 422
Fin 352
NaN 81
Name: GarageFinish, dtype: int64
Column Name: GarageQual
TA 1311
NaN 81
Fa 48
Gd 14
Ex 3
Po 3
Name: GarageQual, dtype: int64
Column Name: GarageCond
TA 1326
NaN 81
Fa 35
Gd 9
Po 7
Ex 2
Name: GarageCond, dtype: int64
Column Name: BsmtFinType2
Unf 1256
Rec 54
LwQ 46
NaN 38
BLQ 33
ALQ 19
GLQ 14
Name: BsmtFinType2, dtype: int64
Column Name: BsmtExposure
No 953
Av 221
Gd 134
Mn 114
NaN 38
Name: BsmtExposure, dtype: int64
Column Name: BsmtFinType1
Unf 430
GLQ 418
ALQ 220
BLQ 148
Rec 133
LwQ 74
NaN 37
Name: BsmtFinType1, dtype: int64
Column Name: BsmtCond
TA 1311
Gd 65
Fa 45
NaN 37
Po 2
Name: BsmtCond, dtype: int64
Column Name: BsmtQual
TA 649
Gd 618
Ex 121
NaN 37
Fa 35
Name: BsmtQual, dtype: int64
Column Name: MasVnrArea
0.0 861
72.0 8
180.0 8
NaN 8
108.0 8
...
337.0 1
415.0 1
293.0 1
259.0 1
621.0 1
Name: MasVnrArea, Length: 328, dtype: int64
Column Name: MasVnrType
None 864
BrkFace 445
Stone 128
BrkCmn 15
NaN 8
Name: MasVnrType, dtype: int64
Column Name: Electrical
SBrkr 1334
FuseA 94
FuseF 27
FuseP 3
Mix 1
NaN 1
Name: Electrical, dtype: int64
'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'BsmtFinType2', 'BsmtExposure', 'BsmtFinType1', 'BsmtCond', 'BsmtQual'
So here, we will replace NaN values for above attributes withh 'Not Present'.
LotFrontage, GarageYrBlt, MasVnrArea, MasVnrType, Electrical
# Replacing NaN with 'Not Present' for below columns
valid_nan_cols= ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'BsmtFinType2', 'BsmtExposure', 'BsmtFinType1', 'BsmtCond', 'BsmtQual']
housing_df[valid_nan_cols]= housing_df[valid_nan_cols].fillna('Not Present')
# Checking percentage of missing values again
missing_info= round(housing_df.isna().sum() * 100/housing_df.shape[0], 2)
missing_info[missing_info > 0].sort_values(ascending= False)
LotFrontage 17.74 GarageYrBlt 5.55 MasVnrArea 0.55 MasVnrType 0.55 Electrical 0.07 dtype: float64
# Checking if there is any relation between GarageYrBlt and GarageType
housing_df[housing_df.GarageYrBlt.isna()]['GarageType'].value_counts(normalize= True)
Not Present 1.0 Name: GarageType, dtype: float64
Initially GarageYrBlt and GarageType both had 5.55% missing value. After imputing NaN values of GarageType with 'Not Available', we can see that GarageYrBlt value is NaN for only those observations where GarageType is 'Not Available'. We can conclude that if garage is not available then there will be no 'GarageYrBlt' value for that. So we can safely impute GarageYrBlt NaN values with 0.
# Imputing missing values of GarageYrBlt column
housing_df['GarageYrBlt']= housing_df['GarageYrBlt'].fillna(0)
I'll perform statistical imputation for rest of the columns after train-test split: LotFrontage, MasVnrArea, MasVnrType, Electrical
MSSubClass: "identifies the type of dwelling involved in the sale", is a categorical variable, but it's appearing as a numeric variable.
# Changing data type of MSSubClass
housing_df['MSSubClass']= housing_df['MSSubClass'].astype('object')
There are 81 attributes in the dataset. So, I am running SweetViz AutoEDA to explore and vizualize the data. The I'll manually explore the attributes that have high correlation coefficient with the target variable.
# Running SweetViz AutoEDA
sv_report= sv.analyze(housing_df)
sv_report.show_notebook()
# Saving the report as html
sv_report.show_html(r'C:\Users\wayto\Desktop\housing\AutoEDA_report.html')
Report C:\Users\wayto\Desktop\housing\AutoEDA_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
# Checking distribution of SalePrice
sns.distplot(housing_df['SalePrice'])
<AxesSubplot:xlabel='SalePrice', ylabel='Density'>
# Plotting numeric variables against SalePrice
numeric_cols= ['GrLivArea','GarageArea','TotalBsmtSF','1stFlrSF','TotRmsAbvGrd','YearBuilt','YearRemodAdd','MasVnrArea',
'BsmtFinSF1','LotFrontage','WoodDeckSF','2ndFlrSF','OpenPorchSF','LotArea']
sns.pairplot(housing_df, x_vars=['GrLivArea','GarageArea','TotalBsmtSF','1stFlrSF','TotRmsAbvGrd'], y_vars='SalePrice', kind= 'reg', plot_kws={'line_kws':{'color':'teal'}})
sns.pairplot(housing_df, x_vars=['YearBuilt','YearRemodAdd','MasVnrArea','BsmtFinSF1','LotFrontage'], y_vars='SalePrice', kind= 'reg', plot_kws={'line_kws':{'color':'teal'}})
sns.pairplot(housing_df, x_vars=['WoodDeckSF','2ndFlrSF','OpenPorchSF','LotArea'], y_vars='SalePrice', kind= 'reg', plot_kws={'line_kws':{'color':'teal'}})
<seaborn.axisgrid.PairGrid at 0x23a1410d070>
# Box plot of catego
cat_cols= ['OverallQual','GarageCars','ExterQual','BsmtQual','KitchenQual','FullBath','GarageFinish','FireplaceQu','Foundation','GarageType','Fireplaces','BsmtFinType1','HeatingQC']
plt.figure(figsize=[18, 40])
for i, col in enumerate(cat_cols, 1):
plt.subplot(7,2,i)
title_text= f'Box plot {col} vs cnt'
x_label= f'{col}'
fig= sns.boxplot(data= housing_df, x= col, y= 'SalePrice', palette= 'Greens')
fig.set_title(title_text, fontdict= { 'fontsize': 18, 'color': 'Green'})
fig.set_xlabel(x_label, fontdict= {'fontsize': 12, 'color': 'Brown'})
plt.show()
plt.figure(figsize=[17,7])
sns.boxplot(data= housing_df, x= 'Neighborhood', y= 'SalePrice', palette= 'Greens')
plt.show()
SalePrice is right sckewed and other numeic feature: 'GrLivArea','GarageArea','TotalBsmtSF','1stFlrSF','TotRmsAbvGrd','YearBuilt','YearRemodAdd','MasVnrArea', 'BsmtFinSF1','LotFrontage','WoodDeckSF','2ndFlrSF','OpenPorchSF','LotArea' have outlier and they all have somewhat linear relation with SalePrice.
Median SalePrice is higher for the houses with higher OverallQual rating. Houses with Excellent quality of the material on the exterior have highest price. Price reduces as quality decreases.
# Creating correlation heatmap
plt.figure(figsize = (20, 12))
sns.heatmap(housing_df.corr(), annot= True, cmap= 'coolwarm', fmt= '.2f', vmin= -1, vmax= 1)
plt.show()
Below features have very high correlation coefficients.
# Dropping GarageCars and TotRmsAbvGrd
housing_df.drop(['GarageCars','TotRmsAbvGrd'], axis= 1, inplace= True)
housing_df.shape
(1460, 79)
housing_df_org= housing_df.copy()
Earlier we have already seen that our target variable SalePrice is heavily right skewed. We can perform log transformation to remove the skewness. It will help to boost model performance.
# Distplot of log transformed SalePrice
sns.distplot(np.log(housing_df['SalePrice']))
plt.show()
It can be seen that after log transformation SalePrice has now near normal distribution.
# Transforming 'SalePrice'
housing_df['SalePrice_log_trans']= np.log(housing_df['SalePrice'])
Now, Dropping SalePrice as we have ceate log transformed of it. Also dropping Id column, as it'll not help in predicction.
# Dropping ID Column and SalePrice
housing_df.drop(['SalePrice','Id'], axis=1, inplace= True)
housing_df.shape
(1460, 78)
# Train-Test Split
y= housing_df['SalePrice_log_trans']
X= housing_df.drop('SalePrice_log_trans', axis= 1)
X_train, X_test, y_train, y_test= train_test_split(X, y, train_size= .7, random_state= 42)
# Getting index values of train test dataset
train_index= X_train.index
test_index= X_test.index
Imputing rest of the features in test and train dataset using median (for continuous variables) and mode (for categorical variables) calculated on train dataset.
# Performing Statistical Imputation for missing values in LotFrontage, MasVnrArea, MasVnrType, Electrical columns
housing_df['LotFrontage'].fillna(X_train['LotFrontage'].median(), inplace= True)
housing_df['LotFrontage'].fillna(X_train['LotFrontage'].median(), inplace= True)
housing_df['MasVnrArea'].fillna(X_train['MasVnrArea'].median(), inplace= True)
housing_df['MasVnrArea'].fillna(X_train['MasVnrArea'].median(), inplace= True)
housing_df['MasVnrType'].fillna(X_train['MasVnrType'].mode(), inplace= True)
housing_df['MasVnrType'].fillna(X_train['MasVnrType'].mode(), inplace= True)
housing_df['Electrical'].fillna(X_train['Electrical'].mode(), inplace= True)
housing_df['Electrical'].fillna(X_train['Electrical'].mode(), inplace= True)
# Getting object and numeric type columns
housing_cat= housing_df.select_dtypes(include= 'object')
housing_num= housing_df.select_dtypes(exclude= 'object')
housing_cat.describe()
| MSSubClass | MSZoning | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | ... | GarageType | GarageFinish | GarageQual | GarageCond | PavedDrive | PoolQC | Fence | MiscFeature | SaleType | SaleCondition | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | ... | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 |
| unique | 15 | 5 | 2 | 3 | 4 | 4 | 2 | 5 | 3 | 25 | ... | 7 | 4 | 6 | 6 | 3 | 4 | 5 | 5 | 9 | 6 |
| top | 20 | RL | Pave | Not Present | Reg | Lvl | AllPub | Inside | Gtl | NAmes | ... | Attchd | Unf | TA | TA | Y | Not Present | Not Present | Not Present | WD | Normal |
| freq | 536 | 1151 | 1454 | 1369 | 925 | 1311 | 1459 | 1052 | 1382 | 225 | ... | 870 | 605 | 1311 | 1326 | 1340 | 1453 | 1179 | 1406 | 1267 | 1198 |
4 rows × 44 columns
# 'Street','Utilities', 'CentralAir' have 2 unique data, so we are encoding with 0 and 1
housing_df['Street']= housing_df.Street.map(lambda x: 1 if x== 'Pave' else 0)
housing_df['Utilities']= housing_df.Utilities.map(lambda x: 1 if x== 'AllPub' else 0)
housing_df['CentralAir']= housing_df.CentralAir.map(lambda x: 1 if x== 'Y' else 0)
For rest of the categorical (Nominal) columns One Hot Encoding will be used.
# Performing get_dummies
cat_cols= housing_cat.columns.tolist()
done_encoding= ['Street','Utilities', 'CentralAir']
cat_cols= [col for col in cat_cols if col not in done_encoding]
dummies= pd.get_dummies(housing_df[cat_cols], drop_first=True)
# Checking all dummies
dummies.head()
| MSSubClass_30 | MSSubClass_40 | MSSubClass_45 | MSSubClass_50 | MSSubClass_60 | MSSubClass_70 | MSSubClass_75 | MSSubClass_80 | MSSubClass_85 | MSSubClass_90 | ... | SaleType_ConLI | SaleType_ConLw | SaleType_New | SaleType_Oth | SaleType_WD | SaleCondition_AdjLand | SaleCondition_Alloca | SaleCondition_Family | SaleCondition_Normal | SaleCondition_Partial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
5 rows × 234 columns
# Concatinating dummies with housing_df dataframe and droping original features
print('housing_df before droping original valiables', housing_df.shape)
print('shape of dummies dataframe', dummies.shape)
housing_df.drop(cat_cols, axis=1, inplace= True)
housing_df= pd.concat([housing_df, dummies], axis= 1)
print('final shape of housing_df', housing_df.shape)
housing_df before droping original valiables (1460, 78) shape of dummies dataframe (1460, 234) final shape of housing_df (1460, 271)
During EDA we have observed few outliers in numeric features. So, using Robust Scaling using median and quantile values instead of Standard Scaling using mean and standard deviation.
# Re-constructing Train-test data
X_train= housing_df.iloc[train_index, :].drop('SalePrice_log_trans', axis= 1)
y_train= housing_df.iloc[train_index, :]['SalePrice_log_trans']
X_test= housing_df.iloc[test_index, :].drop('SalePrice_log_trans', axis= 1)
y_test= housing_df.iloc[test_index, :]['SalePrice_log_trans']
# Performing scaling of numeric columns in training and test dataset using RobustScaler
num_cols= housing_num.columns.tolist()
num_cols.remove('SalePrice_log_trans')
scaler= RobustScaler(quantile_range=(2, 98))
scaler.fit(X_train[num_cols])
X_train[num_cols]= scaler.transform(X_train[num_cols])
X_test[num_cols]= scaler.transform(X_test[num_cols])
# Checking scaled features
X_train[num_cols].head()
| LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | ... | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1452 | -0.360825 | -0.255342 | -0.2 | 0.0 | 0.318533 | 0.186441 | 0.126743 | 0.109760 | 0.000000 | -0.290195 | ... | 0.045214 | 0.000000 | 0.003912 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.090909 | -0.50 |
| 762 | 0.020619 | -0.041372 | 0.2 | 0.0 | 0.357143 | 0.254237 | 0.000000 | -0.255872 | 0.000000 | 0.149603 | ... | 0.143361 | 0.367391 | 0.070423 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.50 |
| 932 | 0.144330 | 0.089208 | 0.6 | 0.0 | 0.328185 | 0.203390 | 0.478454 | -0.272651 | 0.000000 | 0.854362 | ... | 0.335245 | 0.000000 | 0.641628 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.272727 | -0.25 |
| 435 | -0.278351 | 0.045983 | 0.2 | 0.2 | 0.231660 | 0.033898 | 0.000000 | -0.003496 | 0.559168 | -0.248137 | ... | 0.072783 | 0.343478 | 0.133020 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.181818 | 0.25 |
| 629 | 0.123711 | -0.024995 | 0.0 | 0.0 | -0.077220 | -0.508475 | 0.410330 | 0.163591 | 0.546164 | -0.117159 | ... | 0.039700 | 0.382609 | -0.105634 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.00 |
5 rows × 33 columns
During EDA, we have seen that there are few categorical features where only a handful of observations differ from a constant value. Remvoing those categorical features having zero or close to zero variance.
var_t= VarianceThreshold(threshold= .003)
variance_thresh= var_t.fit(X_train)
col_ind= var_t.get_support()
# Below columns have very low variance
X_train.loc[:, ~col_ind].columns
Index(['Utilities', 'MSSubClass_40', 'LotConfig_FR3', 'Neighborhood_Blueste',
'Condition1_RRNe', 'Condition2_Feedr', 'Condition2_PosA',
'Condition2_PosN', 'Condition2_RRAe', 'Condition2_RRAn',
'Condition2_RRNn', 'RoofStyle_Shed', 'RoofMatl_Membran',
'RoofMatl_Metal', 'RoofMatl_Roll', 'RoofMatl_WdShake',
'Exterior1st_AsphShn', 'Exterior1st_BrkComm', 'Exterior1st_CBlock',
'Exterior1st_ImStucc', 'Exterior1st_Stone', 'Exterior2nd_AsphShn',
'Exterior2nd_CBlock', 'Exterior2nd_Other', 'Exterior2nd_Stone',
'ExterCond_Po', 'Foundation_Wood', 'BsmtCond_Po', 'Heating_OthW',
'Heating_Wall', 'HeatingQC_Po', 'Electrical_FuseP', 'Electrical_Mix',
'Functional_Sev', 'GarageQual_Po', 'GarageCond_Po', 'PoolQC_Fa',
'PoolQC_Gd', 'MiscFeature_Othr', 'MiscFeature_TenC', 'SaleType_Con',
'SaleType_ConLI', 'SaleType_ConLw', 'SaleType_Oth'],
dtype='object')
# Checking number of apperance of one of the attributes/categorical value in dataset
housing_df_org.Functional.value_counts()
Typ 1360 Min2 34 Min1 31 Mod 15 Maj1 14 Maj2 5 Sev 1 Name: Functional, dtype: int64
It can be seen that Functional_Sev or Functional with 'Sev' type has only one observation in entire dataset.
# Removing above columns from train and test dataset
X_train= X_train.loc[:, col_ind]
X_test= X_test.loc[:, col_ind]
# Checking shape of final training dataset
X_train.shape
(1021, 226)
# Selecting few values for alpha
range1= [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
range2= list(range(2, 1001))
range1.extend(range2)
params_grid= {'alpha': range1}
# Applying Ridge and performing GridSearchCV to find optimal value of alpha (lambda)
ridge= Ridge(random_state= 42)
gcv_ridge= GridSearchCV(estimator= ridge,
param_grid= params_grid,
cv= 3,
scoring= 'neg_mean_absolute_error',
return_train_score= True,
n_jobs= -1,
verbose= 1)
gcv_ridge.fit(X_train, y_train)
Fitting 3 folds for each of 1013 candidates, totalling 3039 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 26 tasks | elapsed: 4.3s [Parallel(n_jobs=-1)]: Done 440 tasks | elapsed: 5.9s [Parallel(n_jobs=-1)]: Done 1440 tasks | elapsed: 9.2s [Parallel(n_jobs=-1)]: Done 2840 tasks | elapsed: 13.6s [Parallel(n_jobs=-1)]: Done 3039 out of 3039 | elapsed: 14.2s finished
GridSearchCV(cv=3, estimator=Ridge(random_state=42), n_jobs=-1,
param_grid={'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3,
0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, ...]},
return_train_score=True, scoring='neg_mean_absolute_error',
verbose=1)
# Checking best estimator
gcv_ridge.best_estimator_
Ridge(alpha=8, random_state=42)
# Checking best MAE
gcv_ridge.best_score_
-0.09607872750066214
Optimal value for alpha is 8.
# Fitting model using best_estimator_
ridge_model= gcv_ridge.best_estimator_
ridge_model.fit(X_train, y_train)
Ridge(alpha=8, random_state=42)
# Evaluating on training dataset
y_train_pred= ridge_model.predict(X_train)
print( 'r2 score on training dataset:', r2_score(y_train, y_train_pred))
print( 'MSE on training dataset:', mean_squared_error(y_train, y_train_pred))
print( 'RMSE on training dataset:', (mean_squared_error(y_train, y_train_pred)**.5))
print( 'MAE on training dataset:', mean_absolute_error(y_train, y_train_pred))
r2 score on training dataset: 0.9179745620613834 MSE on training dataset: 0.012726743371864514 RMSE on training dataset: 0.1128128688220653 MAE on training dataset: 0.07633282124522903
# Evaluating on testing dataset
y_test_pred= ridge_model.predict(X_test)
print( 'r2 score on testing dataset:', r2_score(y_test, y_test_pred))
print( 'MSE on testing dataset:', mean_squared_error(y_test, y_test_pred))
print( 'RMSE on testing dataset:', (mean_squared_error(y_test, y_test_pred)**.5))
print( 'MAE on testing dataset:', mean_absolute_error(y_test, y_test_pred))
r2 score on testing dataset: 0.8911807696767164 MSE on testing dataset: 0.018419435953924413 RMSE on testing dataset: 0.135718222630288 MAE on testing dataset: 0.09344961892214859
# Ridge coefficients
ridge_model.coef_
array([-2.28607286e-02, 3.26858083e-02, 7.47810052e-03, 2.23987519e-01,
1.24470159e-01, 7.66247129e-02, 7.28425086e-02, -2.90168575e-03,
-1.58408033e-02, 2.29272203e-02, 1.58771199e-02, 8.96110075e-03,
7.11408978e-02, 1.12534773e-01, 1.18302436e-01, 2.16759436e-04,
1.63084631e-01, 4.36162120e-02, 2.33380748e-03, 9.52424254e-02,
4.74081655e-02, 7.80661457e-02, -3.34123770e-02, 3.38010659e-02,
2.14287679e-02, 1.15723918e-01, 5.21515168e-02, -7.93268347e-03,
2.61177416e-02, 2.54605740e-04, 5.54512975e-02, -1.55744549e-04,
-1.84399660e-03, 1.39464545e-02, -1.13248513e-02, -9.06561216e-02,
-5.80271068e-03, 3.16893459e-03, -8.77041776e-03, 4.51004222e-02,
2.74201200e-02, -8.70330603e-03, 1.13340788e-02, -3.83624049e-03,
-2.79125874e-02, -8.23543753e-02, -2.05872467e-02, -1.25423973e-02,
3.93661719e-02, 1.40282718e-02, 4.09924250e-02, -8.98809882e-03,
-2.68928669e-02, 3.89563436e-02, 2.57005569e-02, -5.91520570e-02,
-1.71959595e-03, 8.24189840e-02, 3.38297680e-02, 6.03868407e-02,
3.56855377e-02, -3.48491716e-02, -1.37541275e-02, 2.75008051e-03,
1.02419524e-02, -4.65041766e-03, 2.44280695e-02, 3.70680414e-02,
-2.47387564e-02, 7.97706924e-02, -7.56958149e-02, -3.02418782e-02,
-4.56195541e-02, -8.28764636e-02, -2.51333466e-02, -3.73925053e-02,
2.08199888e-02, -1.56131908e-02, 6.74919926e-02, 8.12581698e-02,
-3.83857835e-02, -3.70081421e-02, -2.65108651e-02, -1.10215597e-02,
3.47664747e-02, 1.15502874e-01, 6.15872491e-03, 3.58568497e-02,
-1.39126055e-03, 5.37767690e-02, 6.48506251e-03, -9.39542362e-03,
-3.44139295e-02, 2.57693148e-02, 1.22803872e-02, 7.05834753e-02,
-1.05079802e-02, -3.83624049e-03, -7.07013131e-02, -6.01528962e-02,
1.65082772e-02, 3.72013354e-02, -1.48216067e-02, 1.36254835e-02,
-2.39467101e-02, -2.47435574e-02, -6.99087130e-03, -4.22795170e-02,
-1.35268303e-02, -2.45653318e-02, 2.51848302e-02, 2.89579959e-02,
2.30107541e-02, 5.09989695e-02, 9.38169627e-02, 2.23451733e-03,
-2.06845751e-02, 1.13142540e-02, 1.83518516e-02, -2.14212953e-02,
-3.92936993e-03, -3.54041045e-02, -1.63743684e-02, -3.44550404e-02,
-8.22917023e-03, 1.50143130e-02, 4.31352578e-03, 3.59947336e-02,
-1.20821280e-02, -2.52638032e-02, -2.01531447e-02, -3.49554224e-03,
3.15994600e-02, -1.42335999e-02, 1.36778819e-02, 6.58341779e-03,
2.94135365e-02, -1.03211848e-02, 3.44147849e-03, -1.99943892e-02,
-1.83425357e-02, -6.68554426e-03, 1.60836224e-02, 1.11627255e-02,
4.31310249e-02, -1.43816099e-02, 1.03771921e-02, -3.91461372e-02,
-6.06783620e-02, -2.75090894e-02, -8.37280053e-02, 4.50780576e-02,
-2.75090894e-02, 5.90624744e-02, 5.36457829e-02, -1.24261802e-02,
-2.39600008e-02, -2.75090894e-02, -5.26248764e-03, 1.75869569e-02,
-2.10698620e-02, -2.75090894e-02, 3.37488128e-03, -5.09622965e-02,
-1.91282286e-02, 1.23506261e-02, -1.15259123e-03, -2.75090894e-02,
-1.02116232e-02, 6.67677936e-03, 1.99133624e-02, 4.94739971e-02,
-5.08305333e-02, -3.10669918e-02, -2.61880502e-02, -2.41310825e-02,
-4.50883770e-03, 9.06585973e-04, -2.47608386e-02, -5.77481100e-02,
-7.27040563e-02, -7.26193871e-02, 1.28156205e-02, 4.78432833e-03,
-9.94013069e-03, 5.56127091e-02, -2.56075216e-02, -2.74744137e-03,
-4.59901634e-02, -1.73135253e-02, 7.15871428e-03, 2.95109960e-02,
8.15539655e-03, 1.44432194e-02, -2.18096668e-02, 1.83479947e-02,
-2.10851992e-02, -2.10851992e-02, 3.11355128e-03, -9.11850286e-03,
-1.93928002e-02, 1.81342010e-02, -2.10851992e-02, -1.35762243e-02,
-2.33847688e-02, 4.01247332e-03, -2.10851992e-02, -4.93372755e-03,
6.15342496e-03, 1.95302605e-02, 3.35771976e-02, -3.78553918e-02,
-7.08854249e-03, -1.29386015e-02, -3.07226147e-03, 1.52415055e-02,
5.89418279e-04, 2.11365929e-02, 5.40695605e-02, 1.92048663e-02,
-1.81365103e-02, 1.87732416e-02, 7.17954247e-02, -7.94231805e-03,
3.69434229e-02, 2.92257280e-02])
# Ridge intercept
ridge_model.intercept_
11.672552874836796
# Top 10 features with double the value of optimal alpha in Ridge
ridge_coef= pd.Series(ridge_model.coef_, index= X_train.columns)
top_25_ridge= ridge_coef[abs(ridge_coef).nlargest(25).index]
top_25_ridge
OverallQual 0.223988 GrLivArea 0.163085 OverallCond 0.124470 2ndFlrSF 0.118302 GarageArea 0.115724 Neighborhood_StoneBr 0.115503 1stFlrSF 0.112535 FullBath 0.095242 Exterior1st_BrkFace 0.093817 MSSubClass_30 -0.090656 BsmtQual_TA -0.083728 Neighborhood_MeadowV -0.082876 LandContour_HLS 0.082419 MSSubClass_160 -0.082354 Neighborhood_NridgHt 0.081258 Neighborhood_Crawfor 0.079771 BedroomAbvGr 0.078066 YearBuilt 0.076625 Neighborhood_Edwards -0.075696 YearRemodAdd 0.072843 KitchenQual_TA -0.072704 Functional_Maj2 -0.072619 SaleCondition_Alloca 0.071795 CentralAir 0.071141 BldgType_Twnhs -0.070701 dtype: float64
# Applying Lasso and performing GridSearchCV to find optimal value of alpha (lambda)
params_grid= {'alpha': range1}
lasso= Lasso(random_state= 42)
lasso_gcv= GridSearchCV(estimator= lasso,
param_grid= params_grid,
cv= 3,
scoring= 'neg_mean_absolute_error',
return_train_score= True,
n_jobs= -1,
verbose= 1)
lasso_gcv.fit(X_train, y_train)
Fitting 3 folds for each of 1013 candidates, totalling 3039 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 30 tasks | elapsed: 0.1s [Parallel(n_jobs=-1)]: Done 1000 tasks | elapsed: 3.1s [Parallel(n_jobs=-1)]: Done 2930 tasks | elapsed: 8.7s [Parallel(n_jobs=-1)]: Done 3039 out of 3039 | elapsed: 9.0s finished
GridSearchCV(cv=3, estimator=Lasso(random_state=42), n_jobs=-1,
param_grid={'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3,
0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, ...]},
return_train_score=True, scoring='neg_mean_absolute_error',
verbose=1)
# Checking best estimator
lasso_gcv.best_estimator_
Lasso(alpha=0.001, random_state=42)
# Checking best MAE
lasso_gcv.best_score_
-0.0950531289020976
Optimal value for alpha is .0001. Next I'll try to fine tune this value by running GridSearchCV with some closer values to .0001
range3= [0.00005, 0.00006, 0.00007, 0.00008, 0.00009, 0.0001, .0002, .0003, .0004, .0005, .0006, .0007, .0008, .0009, .001]
params_grid= {'alpha': range3}
lasso_gcv= GridSearchCV(estimator= lasso,
param_grid= params_grid,
cv= 3,
scoring= 'neg_mean_absolute_error',
return_train_score= True,
n_jobs= -1,
verbose= 1)
lasso_gcv.fit(X_train, y_train)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
Fitting 3 folds for each of 15 candidates, totalling 45 fits
[Parallel(n_jobs=-1)]: Done 45 out of 45 | elapsed: 0.7s finished
GridSearchCV(cv=3, estimator=Lasso(random_state=42), n_jobs=-1,
param_grid={'alpha': [5e-05, 6e-05, 7e-05, 8e-05, 9e-05, 0.0001,
0.0002, 0.0003, 0.0004, 0.0005, 0.0006,
0.0007, 0.0008, 0.0009, 0.001]},
return_train_score=True, scoring='neg_mean_absolute_error',
verbose=1)
# Checking best estimator
lasso_gcv.best_estimator_
Lasso(alpha=0.0006, random_state=42)
So, for Lasso we are getting optimal value of alpha as .0006.
# Fitting model using best_estimator_
lasso_model= lasso_gcv.best_estimator_
lasso_model.fit(X_train, y_train)
Lasso(alpha=0.0006, random_state=42)
# Evaluating on training dataset
y_train_pred= lasso_model.predict(X_train)
print( 'r2 score on training dataset:', r2_score(y_train, y_train_pred))
print( 'MSE on training dataset:', mean_squared_error(y_train, y_train_pred))
print( 'RMSE on training dataset:', (mean_squared_error(y_train, y_train_pred)**.5))
print( 'MAE on training dataset:', mean_absolute_error(y_train, y_train_pred))
r2 score on training dataset: 0.9102028774244364 MSE on training dataset: 0.013932567301942209 RMSE on training dataset: 0.11803629654450452 MAE on training dataset: 0.0785380591573057
# Evaluating on testing dataset
y_test_pred= lasso_model.predict(X_test)
print( 'r2 score on testing dataset:', r2_score(y_test, y_test_pred))
print( 'MSE on testing dataset:', mean_squared_error(y_test, y_test_pred))
print( 'RMSE on testing dataset:', (mean_squared_error(y_test, y_test_pred)**.5))
print( 'MAE on testing dataset:', mean_absolute_error(y_test, y_test_pred))
r2 score on testing dataset: 0.8947392213072709 MSE on testing dataset: 0.01781710976847528 RMSE on testing dataset: 0.13348074680820182 MAE on testing dataset: 0.09142208307508749
# Checking no. of features in Ridge and Lasso models
lasso_coef= pd.Series(lasso_model.coef_, index= X_train.columns)
selected_features= len(lasso_coef[lasso_coef != 0])
print('Features selected by Lasso:', selected_features)
print('Features present in Ridge:', X_train.shape[1])
Features selected by Lasso: 116 Features present in Ridge: 226
# Lasso intercept
lasso_model.intercept_
11.720769816445253
# Top 25 features with coefficients in Lasso model
top25_features_lasso= lasso_coef[abs(lasso_coef[lasso_coef != 0]).nlargest(25).index]
top25_features_lasso
GrLivArea 0.377281 OverallQual 0.309806 OverallCond 0.144188 Neighborhood_StoneBr 0.136858 GarageArea 0.134759 YearBuilt 0.113229 Exterior1st_BrkFace 0.103740 Neighborhood_NridgHt 0.100032 Neighborhood_Crawfor 0.096631 MSSubClass_30 -0.095879 BldgType_Twnhs -0.086798 BsmtQual_Not Present -0.080171 FullBath 0.078853 LandContour_HLS 0.074670 YearRemodAdd 0.073642 CentralAir 0.071640 GarageType_Not Present -0.071603 BldgType_TwnhsE -0.071167 Neighborhood_NoRidge 0.068999 LotShape_IR3 -0.063465 Neighborhood_Somerst 0.061472 BsmtExposure_Gd 0.059688 BedroomAbvGr 0.058771 ScreenPorch 0.057452 MSSubClass_160 -0.057430 dtype: float64
Lasso and Ridge both have similar r2 score and MAE on test dataset. But Lasso has eliminated 110 features and final no. of features in Lasso Regression model is 116. Where Ridge has all 226 features. So, our Lasso model is simpler than Ridge with having similar r2 score and MAE.
Considering above points we can choose our Lasso Regression model as our final model.
# Ploting top 25 features
plt.figure(figsize= (7, 5))
top25_features_lasso.plot.barh(color= (top25_features_lasso > 0).map({True: 'g', False: 'r'}))
plt.show()
## Doubling value of optimal alpha in Ridge
ridge2= Ridge(alpha= 16, random_state= 42)
ridge2.fit(X_train, y_train)
Ridge(alpha=16, random_state=42)
# Top 10 features with double the value of optimal alpha in Ridge
ridge_coef2= pd.Series(ridge2.coef_, index= X_train.columns)
top10_ridge2= ridge_coef2[abs(ridge_coef2).nlargest(10).index]
top10_ridge2
OverallQual 0.205139 GrLivArea 0.145057 OverallCond 0.111117 GarageArea 0.106036 1stFlrSF 0.103177 2ndFlrSF 0.101840 FullBath 0.091148 Neighborhood_StoneBr 0.087765 MSSubClass_30 -0.081856 Exterior1st_BrkFace 0.078220 dtype: float64
## Doubling value of optimal alpha in Lasso
lasso2= Lasso(alpha= .0012, random_state=42)
lasso2.fit(X_train, y_train)
Lasso(alpha=0.0012, random_state=42)
# Top 10 features with double the value of optimal alpha in Lasso
lasso_coef2= pd.Series(lasso2.coef_, index= X_train.columns)
top10_lasso2= lasso_coef2[abs(lasso_coef2[lasso_coef2 != 0]).nlargest(10).index]
top10_lasso2
GrLivArea 0.368980 OverallQual 0.364970 GarageArea 0.148863 OverallCond 0.130027 Neighborhood_StoneBr 0.086031 Exterior1st_BrkFace 0.081740 Neighborhood_NridgHt 0.080808 MSSubClass_30 -0.080712 YearRemodAdd 0.076909 CentralAir 0.076685 dtype: float64
# Checking top 5 features in our lasso model
top25_features_lasso.nlargest()
GrLivArea 0.377281 OverallQual 0.309806 OverallCond 0.144188 Neighborhood_StoneBr 0.136858 GarageArea 0.134759 dtype: float64
As Neighborhood_StoneBr is a dummy variable, we'll drop entire Neighborhood feature.
# Checking all Neighborhood dummy variables
cols_to_drop= X_train.columns[X_train.columns.str.startswith('Neighborhood')].tolist()
cols_to_drop.extend(['GrLivArea','OverallQual','OverallCond','GarageArea'])
cols_to_drop
['Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_IDOTRR', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes', 'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker', 'GrLivArea', 'OverallQual', 'OverallCond', 'GarageArea']
# Droping above features from X_train and X_test
X_train= X_train.drop(cols_to_drop, axis= 1)
X_test= X_test.drop(cols_to_drop, axis= 1)
X_train.shape, X_test.shape
((1021, 199), (439, 199))
# Building Lasso model with these features
lasso3= Lasso(alpha= .0006, random_state= 42)
lasso3.fit(X_train, y_train)
Lasso(alpha=0.0006, random_state=42)
# Top 5 features after droping top 5 features of Previous Lasso model
lasso_coef3= pd.Series(lasso3.coef_, index= X_train.columns)
top5_lasso3= lasso_coef3[abs(lasso_coef3[lasso_coef3 != 0]).nlargest().index]
top5_lasso3
1stFlrSF 0.402011 2ndFlrSF 0.369645 GarageType_Not Present -0.136456 KitchenQual_TA -0.132718 Exterior1st_BrkFace 0.130630 dtype: float64